4.3 Comparing Two Models with the McNemar Test

The McNemar test, (...), is a non-parametric statistical test for paired comparisons that can be applied to compare the performance of two machine learning classifiers.

「McNemar検定は、2つの機械学習分類器の性能を比較するのに適用できるペアの比較のためのノンパラメトリックな統計検定」

within-subjects chi-squared test（対象内の？ χ2乗検定）とも言及される

2つのモデルの予測を比較する2x2の混同行列（分割表）

Figure 19

モデル1のcorrect / wrongとモデル2のcorrect / wrong

4つの領域

モデル1 correct・モデル2 correct：A

モデル1 correct・モデル2 wrong：B

モデル1 wrong・モデル2 correct：C

モデル1 wrong・モデル2 wrong：D

テストサンプル数をnとする（A+B+C+D=n）

モデル1のaccuracyは (A+B)/n

モデル2のaccuracyは (A+C)/n

Cells B and C (the off-diagonal entries), however, tell us how the models differ.

（「BとCは非対角 off-diagonal」）

Figure 20 (n=10000)

subpanel (scenario) A　（←セルのアルファベットと重なってちょっとわかりにくい）

モデル1 accuracy: (A+B)/n = (9959+11)/10000 = 0.997

モデル2 accuracy: (A+C)/n = (9959+1)/10000 = 0.996

subpanel (scenario) B

モデル1 accuracy: (A+B)/n = (9945+25)/10000 = 0.997

モデル2 accuracy: (A+C)/n = (9945+1)5/10000 = 0.996

subpanel2つでモデルのaccuracyは変わらない

based on this 11:1 ratio, we may conclude, based on our intuition, that Model 1 performs substantially better than Model 2.

subpanel Aでは、モデル1がcorrectでモデル2がwrongが11サンプル、モデル1がwrongでモデル2がcorrectが1サンプル

「11:1という比率に基づくと、直観を元に、モデル1の方がモデル2よりかなり（汎化）性能がよいと結論づけるかもしれない」

subpanel Bでは25:15という比率。subpanel Aのようにはやや結論づけがたい

McNemar検定

we formulate the null hypothesis that the probabilities p(B) and p(C) – where B and C refer to the confusion matrix cells introduced in an earlier figure – are the same

「帰無仮説は、確率p(B)とp(C)が同一」

言い換えると、2つのモデルのどちらも他方よりよい性能でない

代替仮説は「2つのモデルの性能は同じでない」

χ2乗統計量を算出

（TODO カイ2乗検定を統計の教科書で確認）

significance threshold（例：α=0.05）に基づきp値を計算

帰無仮説が正しい時、p値は与えられたχ2乗値の観測される確率

If the p-value is lower than our chosen significance level, we can reject the null hypothesis

1自由度のχ2乗分布を近似（4.6 Cochran’s Q Test for Comparing the Performance of Multiple Classifiersも参照）

scenario Bではχ2乗値は2.5

p値=0.1138 > (α=)0.05 なので帰無仮説は棄却されない

scenario Aではχ2乗値は8.3

p値=0.0039 < (α=)0.05 なので帰無仮説は棄却される

we can conclude that the models’ performances are different (for instance, Model 1 performs better than Model 2).

McNemar検定のcontinuity corrected version

χ2乗値算出式の分子が(B-C)**2から(|B-C|-1)**2に修正された

（感想：セルBとCの数が1個違いのときに最小になるように修正？）

mlxtendに実装：mcnemar: McNemar's test for classifier comparisons

McNemar検定はセルBとCのサンプル数が50を超えるときp値をよく見積もれる（4.4）